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ABSTRACT 

In  this  paper,  we  present  an  innovative  recursive  motion 
estimation  technique  that  can  take  advantage  of  the  in-depth 
resolution  (range)  to  perform  an  accurate  estimation  of 
objects  that  have  undergone  3-D  translational  and  rotational 
movements.  This  approach  iteratively  aims  at  minimizing 
the  error  between  the  object  in  the  current  frame  and  its 
compensated  object  using  estimated  motion  displacement 
from  the  previous  range  measurements.  In  addition,  in  order 
to  use  the  range  data  on  the  non-rectangular  grid  in  the 
Cartesian  coordinate,  we  consider  a  combination  of 
derivative  filters  and  the  transformation  between  the 
Cartesian  coordinates  and  the  sensor-centered  coordinates. 
For  sequences  of  moving  range  images  we  demonstrate  the 
effectiveness  of  the  proposed  scheme. 

Index  Terms —  3-D  motion  estimation,  range  image, 
object  tracking,  Ladar,  Laser  scanners 

1.  INTRODUCTION 

Classical  motion  estimation  techniques  in  computer  vision 
use  intensity  images  or  stereovision  to  estimate  3-D  motion 
parameters.  These  techniques  are  not  yet  sufficiently  robust 
to  be  used  for  highly  sensitive  real  time  systems.  Recently, 
with  the  rapid  progress  of  high-speed  range  camera 
technology,  capturing  what  is  referred  to  as  2.5-D  images  is 
becoming  possible.  These  images,  which  provide  precise 
measurements  of  geometry  of  the  3-D  environment,  can 
make  motion  estimation  and  object  tracking  much  easier  and 
more  reliable.  In  general,  there  are  two  classes  of  motion 
estimation  algorithms  for  range  images.  Class  one  is  for 
rigid  motion  surfaces  [l]-[8],  and  the  other  is  for  moving 
deformable  surfaces  [9],  [10].  Class  one  can  be  further 
divided  into  two  categories.  The  first  is  a  feature-based 
algorithm  [3],  [4],  whose  performance  depends  on  the 
detection  of  reliable  range  image  features  and  the 
establishment  of  interframe  correspondence  among  them. 
The  other  is  a  direct  area-based  algorithm  [1],  [2],  [5]-[8], 
which  is  more  straightforward  than  the  feature-based 
algorithm.  In  our  approach,  which  falls  in  this  category,  we 
are  mainly  concerned  with  rigid  motion  where  the  structure 
of  moving  images  is  based  on  a  single  beam  laser  scanner 
technology.  In  this  technology  a  deflection  mirror  assembly 
scans  a  beam  over  the  scene.  This  type  of  technique  has 
been  widely  used  for  many  tactical  and  industrial 


applications  and  uses  different  types  of  range  measurement 
technologies.  One  example  is  the  Time  of  Flight  (Pulsed) 
laser  range  modules,  which  send  short  pulses  that  are 
reflected  by  surrounding  objects.  Note  that  with  this 
technology,  a  three-dimensional  scan  of  a  scene  is  obtained 
by  deflecting  the  laser  beam  in  equal  increments  of  angle  in 
horizontal  and  vertical  planes.  A  scanned  scene  can  then  be 
represented  in  terms  of  range  p ,  horizontal  angle  6 ,  and 
elevation  angle  (j) ,  which  corresponds  to  a  spherical  (polar) 
coordinate  system. 

By  converting  a  range  image  from  the  spherical  coordinate 
system  to  a  so-called  Cartesian  Elevation  Map  (CEM),  Horn 
and  Harris  [1]  developed  a  recovery  system  for  the  six 
degrees  of  freedom  of  motion  of  a  vehicle,  which  has  been  a 
challenging  problem  in  autonomous  navigation.  In  CEM  the 
depth  Z  is  expressed  as  a  function  of  X  and  Y,  which 
corresponds  to  displacements  in  the  horizontal  plane.  This 
time  varying  CEM  is  used  to  estimate  translational  and 
rotational  movements  of  rigid  objects. 

Although  the  optimized  solution  offered  by  Horn  and  Harris 
has  been  very  effective,  it  does  not  always  produce  very 
accurate  estimation  of  3-D  motion  displacements,  which  is 
crucial  for  highly  sensitive  robotic  operations.  Thus,  here  we 
present  a  recursive  approach  to  enhance  estimation 
accuracy.  As  will  be  described  next,  this  iterative  approach 
is  based  on  minimizing  the  error  between  the  new  position 
of  the  object  and  its  previous  location,  after  being 
compensated  using  estimated  motion  displacements.  In 
addition,  since  a  set  of  3-D  points  obtained  in  the  CEM 
coordinate  may  not  be  placed  regularly  on  a  rectangular 
grid,  we  present  a  method  that  uses  a  non-rectangular  grid  to 
reconstruct  the  displaced  frame.  This  scheme  employs 
derivative  filters  together  with  transformation  between  the 
Cartesian  coordinates  and  sensor-centered  coordinates  for 
image  reconstruction. 

2.  3-D  RIGID  MOTION  ESTIMATION 

Recovery  of  the  six  degrees  of  freedom  of  motion 
displacement  can  be  best  accomplished  by  using  time 
varying  CEM,  as  proposed  by  Horn  and  Harris  [1].  Their 
algorithm  is  based  on  the  assumption  that  most  of  the 
surface  is  smooth  so  that  local  tangent  planes  can  be 
constructed.  In  addition,  the  motion  between  frames  is 
smaller  than  the  size  of  most  features  in  the  range  image 
Eurthermore,  the  environment  is  a  single  rigid  assemblage 
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and  only  the  motion  of  the  sensor  relative  to  the 
environment  has  to  be  recovered. 

A  time  varying  CEM  can  be  expressed  as  Z(X,  Y,  fj;  where  t 
denotes  time,  Z  is  the  depth,  and  X  and  Y  are  displacements 
in  the  horizontal  and  vertical  plane,  respectively.  For  a  rigid 
motion  scene,  the  motion  can  be  described  as  instantaneous 
translational  velocity  and  instantaneous  angular  velocity. 
For  every  3-D  point,  an  elevation  rate  constraint  equation 
relating  derivatives  of  X,  Y,  Z  can  be  obtained  as  [1], 

Z  =  pX+qY+Z,,  (1) 

where  p  =  dZ/dX,  q  =  dZ/dY,  Z,=dZ/dt,  X=dX/dt, 
Y  =  dY/dt,  Z  =  dZ/dt. 

The  vector  to  a  point  on  the  surface;  R  =  (X,Y,  ZY 

dRI  dt  =-t  -&xR,  (2) 

where  t  =[U  V  Wf  is  translational  velocity  and 
®  =  [A  B  Cf  is  rotational  velocity.  Then  it  has; 

X  =-U-BZ  +  CY 

(3) 

■Y  =  -V-CX+AZ 
Z  =  -W-AY  +  BX 

From  (1)  and  (3), 

pU  +  qV -W  +  rA  +  sB  +  tC  =  Z,,  (4) 

where  r  =  -Y  -qZ,  s  =  X  +  pZ  ,  t  =  qX  -  pY  ■ 

Fet’s  assume  that  there  is  a  set  of  m  pixels  in  the  image  and 
for  each  such  pixel  we  define  the  following  set  of  six 
dimensional  vectors  for  the  n*'  pixel, 

<l>„=k  q„  -1  r„  tJ^Z)  =  [^7  V  W  A  B  C]  ^ 
From  (4)  the  rate  of  change  for  elevation  (Z,)  at  pixel  n  can 
be  shown  as, 

(Z,)„=O^D-  (5) 

Based  on  the  above  equation  we  can  estimate  the  motion 
iteratively,  where  at  each  iteration  the  previous  estimate  is 
used  in  the  process.  Fet’s  assume  that  in  this  process  two 
consecutive  video  frames  (generated  at  a  fixed  frame  rate) 
are  used  to  measure  the  change  of  rate  of  elevation.  After 
each  iteration  the  estimated  motion  vectors  are  used  to 
reconstruct  the  compensated  first  frame  for  the  next 
iteration. 

From  (5)  we  can  show, 

noise  =  -  D"')  (6) 

where  y  is  the  measurement  of  the  displaced  frame 
difference  (DFD)  between  the  second  frame  and  the 
compensated  first  frame  (i.e.  the  estimated  second  frame) 
using  the  estimated  motion  vectors  [13]. 

For  a  cluster  of  m  moving  pels,  after  carrying  out  the 
minimization,  the  least-squares  estimate  of  D  is, 

n=l 


(7) 


D‘  =D‘-'  + 


(8) 


In  order  to  obtain  the  new  position  of  each  displaced  pixel 
on  a  non-rectangular  grid  in  CEM,  we  developed  a 
combination  of  derivative  filters  [12]  and  transformation 
between  the  Cartesian  coordinates  and  the  sensor-centered 
coordinates  in  a  non-rectangular  grid  coordinate. 

To  use  the  range  data  on  the  non-rectangular  sensor  grid 
directly  for  motion  estimation,  a  new  version  of  the  range 
flow  constraint  equation  is  derived  in  [12].  The  three 
components  of  the  motion  vector  for  one  point  (i.e.  on  X,  Y, 
Z  directions)  can  be  written  as; 


\X  =  X,x+X^y+X, 

Y  =  Y^x+Y^y+Y, 

Z  =  Z^x+  Zy  y+  Z, 

where  X^=dX/dx,  X^=dX/dy,  X,=dX/dt, 
Y  =dYldy,  Y,=dY/dt,  Z^=dZldx, 


(9) 

-dY  Idxi 
=  dZ  Idy  ^ 


Z,=dZldt,  x  =  dxldt,  y  =  dy/dt,  x,y  are  the  sensor  grid 
(range  image)  index. 

Eliminating  x  and  y  then  compared  with  equation  (1). 

[  P=iYyZ^-KZ^y(XJ,-X/J  .  (10) 

q  =  {X,Z^-X^ZAI(XJ^-X/A 

[z,  =(XJ^Z,  +X/,Z^  +X,Y^^ -XXZ, -XX, z, -X,Y,Z,)I(XJ,  -X,YJ 

In  order  to  reconstmct  the  first  frame  after  each  iteration  in  a 
non-rectangular  grid,  we  perform  motion  compensation 
directly  on  the  spherical  (polar)  coordinate.  This  requires  the 
transformation  between  (p,  0 ,  (jy)  and  {X,  Y,  Z)  each  time 
the  motion  vector  estimation  is  updated.  The  transformation 
from  sensor-centered  coordinates  {p,  6,  (j))  to  Cartesian 
coordinates  {X,  Y,  Z)  can  be  shown  as, 

X  =  psinffcos^ 

<  Y  =  p  simp 
Z  =  pcos0cosp 

Similarly,  from  (X,  T,  Z)  to  ( yO ,  0 ,  (p). 

p  =  ^X^  +Y^  +Z^ 

0  =  arctan(X  /  Z) 

=  arctan(y  /  VxkTk) 

Given  the  first  frame  F^ ,  and  the  estimated  motion  vector 
MV ,  the  estimated  second  frame  will  be; 

\x^(x',y)  =  X,(x,y)  +  MV^ 

YAx,y')  =  YXx,y)  +  MV,  ’ 

Zj  (x',  y')  =  ZXx,y)  +  MV^ 
where  x,  y ,  x',  y  are  the  image  index.  For  range  data  on  the 
rectangular  grid,  we  can  directly  obtain  (x',  y')  as, 

Y'  =  x  +  MV,i /AX  ^ 

\y=y+MVy/AY 


(11) 


(12) 


(13) 


(14) 
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where  aZ  =  X(x  +  \,y)-  X(x,y),  XY  =Y(x,y  +  l)-¥(x,y)- 

However,  since  the  3-D  range  data  in  the  X,  Y,  and  Z 
coordinate  system  are  not  on  the  rectangular  grid,  we  cannot 
directly  incorporate  the  motion  vector  to  reconstruct  the 
motion  compensated  frame.  At  the  same  time,  the  3-D 
points  in  the  sensor  centered  coordinate  {p ,  d ,  (j))  system 
has  the  property  that  Xd  and  A(Z>  are  constant,  where 
A6>  =  6’(jc-l-l,y)-6’(jc,y)  and  =  + 

Therefore,  each  time  the  motion  vector  is  estimated  in  the  X, 
y,  Z  coordinates,  motion  compensation  is  performed  on  the 
spherical  coordinate  where, 

(Xi.y.Zj)  — >  ,(!>[)  5  (X^,Y2-,Yp  — >  {p^,9^,^p  ■ 

Then  we  can  obtain  {x',  y')  as, 

X  =  X  +  (02-ffl)/ .  (15) 

y'=y  +  ik ^0 

3.  RESULTS 

In  order  to  quantitatively  analyze  our  proposed  3-D  motion 
estimation  algorithm  we  have  synthetically  generated 
sequences  of  moving  range  images.  In  particular,  these 
moving  images  are  produced  in  such  a  way  that  a  3-D  object 
can  be  displaced  in  accordance  with  the  predefined  motion 
displacement  parameters.  These  images  can  allow  us  to 
evaluate  the  accuracy  of  estimated  motion  vectors  with 
reference  to  the  actual  displacement  parameters. 

Moving  range  image  sequences  were  constructed  via  3-D 
OOGL  (Object  Oriented  Graphics  Library)  files.  OOGL  is  a 
3-D  object  data  file  in  which  an  object  is  defined  by 
vertices,  lines  and  surfaces.  Fig.  1  shows  an  OOGL  file 
called  as  “igea”,  which  was  selected  here  to  generate  a  range 
video  sequence  for  our  simulation. 

RIF  file  is  a  range  image  format,  which  is  based  on  the 
Cartesian  coordinates  (X,  Y,  Z  components)  and  consists  of 
the  object  points  and  the  Mask  map  (indicates  where  there 
are  object  points).  In  this  format  frames  with  moving  objects 
are  constructed  by  first  displacing  the  object  in  the  OOGL 
file  and  then  transforming  it  to  the  RIF  format.  In  this  way 
we  can  create  a  sequence  of  moving  range  images  (frames) 
where  the  object  in  each  frame  can  be  displaced  by  a 
predefined  3-D  motion  vector.  In  order  to  assess  the 
performance  of  the  motion  estimation,  we  deliberately 
corrupted  the  second  range  image  with  zero  mean,  additive 
Gaussian  noise.  Different  levels  of  noise,  as  described  by 
the  standard  deviation,  are  added  to  the  range  component, 
p ,  in  the  spherical  coordinate  (before  transformation  to  the 
GEM  coordinate). 

Now  we  present  the  simulation  results  of  the  proposed 
motion  estimation  technique  in  accordance  with  equation 
(8).  From  this  equation  we  can  observe  that  for  i  =  1  (first 
iteration)  and  for  the  initial  estimate  D®  =  0,  (8)  reduces  to 
the  Horn  and  Harris  algorithm  [1].  Therefore,  any 
improvement  after  the  first  iteration  is  credited  to  the 
proposed  recursive  method  over  the  Horn  and  Harris 
algorithm.  Another  factor  affecting  the  performance  of  the 


estimation  method  is  dealing  with  the  non-rectangular  grid 
typical  of  range  images  in  the  X,  Y,  and  Z  coordinate  system. 
As  described  in  Section  II,  we  have  developed  a  method 
which  is  a  combination  of  the  derivative  filter  and 
transformation  between  { p,  e,  (p)  and  (X,  Y,  Z). 

We  use  two  criteria  as  a  measure  of  performance:  Mean 
Square  Error  (MSE)  and  Motion  Vector  Error  (MVE).  The 
MSE  between  Erame  1  and  Erame  2  is  defined  as, 

MSE  =  -y[(X2-X,f  +  (Y2-Y,f  +  (Z2-Z,fY 
m 

where  R  is  the  region  that  combines  both  objects  in  two 
frames, p  =  MASKi[jMASK2’  ^  the  number  of  the  points  in 
region  R. 

Given  the  true  motion  parameters  {ij ,V ,W ,A,B,C)^Yid  the 
estimated  ones  (U,V,W,A,B,C) ,  the  MVE  is  defined  as: 

II  -  C/ 1  +  Ik  -  V I  +  llT  -  I  + 1 A  -  a| -I- Ifi  -  s|  +  |c  -  cl 
=  J - LJ —  — U - 1  • 

|C/|  +  |k|-l-|lT|-l-|A|  +  |B|-l-|C| 

In  our  experiments  we  set  the  maximum  number  of 
iterations  to  16.  However,  if  the  MSE  difference  between 
successive  iterations  is  less  than  a  threshold  (i.e.,  0.1)  and 
the  current  MSE  larger  than  the  previous  one,  the  previous 
estimation  will  be  selected  and  the  iteration  will  be  stopped. 


Fig  1.  The  OOGL  files:  “igea” 

We  carried  out  these  experiments  under  various  test 
conditions.  Eor  example,  we  used  different  parameters  to 
transform  a  3-D  image  (see  Eig.  1)  from  OOGL  to  RIE. 
Based  on  the  3-D  test  image  shown  in  Eig.  1,  we  created  a 
large  number  of  range  video  sequences  with  different  view 
angles  and  different  translation  and  rotational  motion. 

The  results  of  our  experiments  are  presented  subjectively 
and  objectively.  In  the  subjective  results  we  show  a 
difference  between  the  second  frame  and  the  estimated 
second  frame.  Note  that  the  estimated  second  frame 
corresponds  to  the  motion  compensated  first  frame  based  on 
the  estimated  motion  parameters  (e.g.,  after  each  iteration). 
This  frame  difference,  as  shown  by  equation  (6)  in  Section 
II,  corresponds  to  the  displaced  frame  difference  (DED).  In 
the  objective  results,  we  show  the  MSE  curve  and  the  MVE 
curve  for  the  recursive  motion  estimation  algorithm.  The 
results,  which  show  two  consecutive  frames  of  “igea” 
image,  are  depicted  in  Eig.  2  and  3.  It  can  be  clearly 
observed  that  the  results  of  the  first  iteration,  which 
correspond  to  the  Horn  and  Harris  algorithm,  are  very  poor. 
This  is  mainly  because  the  surfaces  of  some  objects  are  not 
smooth  enough  and  there  are  many  surfaces  that  are  not 
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always  conjoined  smoothly.  However,  after  the  first 
iteration,  due  to  the  proposed  recursive  motion  estimation 
algorithm,  the  estimated  motion  parameters  approach  the 
actual  motion  parameters. 

In  order  to  test  the  resistance  of  the  motion  estimation 
scheme  to  noise,  we  added  a  different  level  of  synthetic 
noise  (white  Gaussian  noise)  on  the  second  range  image.  We 
then  averaged  the  results  by  running  each  test  20  times.  The 
results  are  depicted  in  Fig.  4.  We  can  see  that  the 
performance  of  the  motion  estimation  drops  as  the  noise 
level  increases.  Nevertheless,  the  recursive  motion 
estimation  continues  to  maintain  its  gain  over  the  Horn  and 
Harris  algorithm. 

Finally,  we  should  point  out  that  for  every  iteration  the 
computational  cost  would  be  the  same  as  with  the  Horn  and 
Harris  algorithm,  except  that  additional  processing  would  be 
required  to  achieve  transformation  between  ( p ,  9 ,  (p  )  and 
{X,  Y,  Z)  after  each  iteration. 


Fig.  2.  Subjective  evaluations  of  the  proposed  motion  estimation  scheme 
for  “igea”.  (a)  The  first  image;  (b)  The  second  image;  (c)  The  estimated 
second  image  using  the  estimated  motion  parameters  of  final  iteration;  (d) 
The  difference  image  between  the  original  two  images;  (e)  The  difference 
image  between  the  second  image  and  the  estimated  second  image  (DFD) 
using  the  motion  parameters  of  the  first  iteration  (Horn  and  Harris 
algorithm);  (f)  The  difference  image  between  the  second  image  and  the 
estimated  second  image  (DFD)  using  the  motion  parameters  of  the  final 
iteration. 


Fig.  3.  Objective  evaluations  of  the  proposed  motion  estimation  scheme  for 
“igea”.  (a)  MSB;  (b)  MVE. 


4.  CONCLUSIONS 

In  the  realm  of  3D  measurements,  high-resolution  range 
moving  images  that  can  accurately  perform  object  tracking 
and  velocity  estimation  would  be  required  for  highly 
sensitive  and  critical  operations.  Thus,  our  main  objective 
has  been  to  improve  motion  estimation  accuracy  involving 
both  rotational  and  translational  movements.  We  have 
presented  a  recursive  motion  estimation  technique  that  can 
take  advantage  of  the  in-depth  resolution  (range).  We  have 


shown  that  displacement  of  objects  with  complex  3-D 
motion  in  range  images  can  be  accurately  estimated  by 
using  the  proposed  recursive  approach.  In  addition,  we 
presented  a  method  of  reconstructing  a  motion  compensated 
frame  in  a  non-rectangular  grid  structure  typical  of  range 
images  in  the  Cartesian  coordinate  system. 


different  level  of  noise,  (a)  MSE;  (b)  MVE. 
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